The files megmag_data.npy and pas_vector.npy can be downloaded here (http://laumollerandersen.org/data_methods_3/megmag_data.npy) and here (http://laumollerandersen.org/data_methods_3/pas_vector.npy)
library(reticulate)
import numpy as np
import os
import matplotlib.pyplot as plt
megmag_data.npy and call it data using np.load. You can use join, which can be imported from os.path, to create paths from different string segmentsdata = np.load('/Users/minaalmasi/Documents/Cognitive_Science/Methods_3/methods3_code/data_W8/megmag_data.npy')
data.shape # prints (number of repetitions, number sensors, number of time samples)
## (682, 102, 251)
There are 682 repetitions, 102 number of sensors and 251 number of time samples.
times that represents this.times = np.arange(-200, 804, 4)
plt.imshow to plot the sensor covariance matrix)N = len(data) #length of data
cov_temp = np.zeros(shape = (102, 102)) #creating an empty matrix
for i in range(N): #for loop in the range of the data
X = data[i,:,:]
X_T = np.transpose(X)
cov_temp += np.dot(X, X_T)
#dividing by length of data
cov = cov_temp / N
## PLOTTING covariance matrix ##
plt.figure()
plt.imshow(cov)
plt.title("1.1iii: Sensor Covariance Matrix")
plt.colorbar()
## <matplotlib.colorbar.Colorbar object at 0x1278fa130>
plt.show()
From the plot, it appears that the sensors seem to pick up different signals for the most part, since the covariance matrix plot of the sensors has a petrol/navy color in general which corresponds to a covariance of about 0. The yellow color at around 50 however indicates high covariance. This covariance is so high that it distorts the rest of the picture. This may be amended by standardising the data:
from sklearn.preprocessing import StandardScaler
N = len(data) #length of data
cov_temp = np.zeros(shape = (102, 102)) #creating an empty matrix
for i in range(N): #for loop in the range of the data
scaler = StandardScaler()
X = data[i,:,:]
X_scaled = scaler.fit_transform(X)
X_T = np.transpose(X_scaled)
cov_temp += np.dot(X_scaled, X_T)
#dividing by length of data
cov_scaled = cov_temp / N
## PLOTTING covariance matrix ##
plt.figure()
plt.imshow(cov_scaled, vmin=-200, vmax=200)
plt.title("1.1iii: Scaled Sensor Covariance Matrix")
plt.colorbar()
## <matplotlib.colorbar.Colorbar object at 0x12a6345e0>
plt.show()
By standardising the data and setting the upper limit to a lower value (vmax = 200), the covariance of the sensors become more apparent. Hence not all sensors pick up completely independent signals.
np.mean - use the axis argument. (The resulting array should have two dimensions with sensors as the first and time as the second)mean_teslas = np.mean(data, axis = 0) #axis = 0 indicates the first dimension (repetitions)
mean_teslas.shape #checking whether we managed to collapse the first dimension
## (102, 251)
plt.axvline and plt.axhlineplt.figure()
plt.plot(times, mean_teslas.T)
plt.axvline(color = "black")
plt.axhline(color = "black")
plt.title("1.1v: Average Magnetic Field Across Repetitions")
plt.xlabel("Time (ms)")
plt.ylabel("Magnetic Field (Tesla)")
plt.show()
np.argmax and np.unravel_index to find the sensor that has the maximal magnetic field.# MAXIMAL MAGNETIC FIELD IN AVG. #
np.max(mean_teslas)
# SENSOR with THE MAXIMAL MAGNETIC FIELD #
## 2.7886216843591933e-13
np.unravel_index(np.argmax(mean_teslas), mean_teslas.shape) #first argument is the index of the maximal tesla value, the second argument is the dimensions of the array
## (73, 112)
times[112] # time in ms
## 248
The maximal magnetic field in the average is 2.7886216843591933e-13.
With np.unravel_index and np.argmax we get the sensor with the maximal magnetic field to be 73 as given by the coordinate system (73, 112). In the calculation, the time is given in the time-samples (112), but by printing this index from the times vector, we get the time in ms of 248. This number fits the peak of the red sensor in the plot from 1.1v.
plt.axvline## SAVING Sensor 73 ##
sensor73 = data[:,73,:]
## PLOTTING Sensor 73 ##
plt.figure()
plt.plot(times, sensor73.T)
plt.axvline(times[112], color = "black", linestyle = "--", label = "time at avg. maximal magnetic field")
plt.legend(loc = "upper right")
plt.axvline(color = "black")
plt.axhline(color = "black")
plt.title("1.1vii: Magnetic Field Across Repetitions for Sensor 73")
plt.xlabel("Time (ms)")
plt.ylabel("Magnetic Field (Tesla)")
plt.show()
In the plot in 1.1vii, we zoom in on the recordings made by sensor 73 for all 682 repetitions (one line for each repetition). On plot 1.1vii, it is possible to see the indication of an signal/effect of sensor 73 as we see that the range of y-values is narrower and more positive at the dotted line.
It is the average of these repetitions that is visible in 1.1v as the red line with the highest peak (sensor 73). Averaging across each repetition cancels the noise and makes the signal more clearly visible.
pas_vector.npy (call it y). PAS is the same as in Assignment 2, describing the clarity of the subjective experience the subject reported after seeing the briefly presented stimulusy = np.load('/Users/minaalmasi/Documents/Cognitive_Science/Methods_3/methods3_code/data_W8/pas_vector.npy')
data array does it have the same length as?# CHECKING THE length of y #
len(y)
# The dimension it has the same length as #
## 682
len(y) == len(data[:,0,0])
## True
y has the same length as the first dimension repetitions in our data array.
## FINDING the indexes of the pas ratings ##
pas1 = np.where(y == 1)
pas2 = np.where(y == 2)
pas3 = np.where(y == 3)
pas4 = np.where(y == 4)
## FINDING the mean TESLA values for each pas rating for all time values ##
mean_pas1 = np.mean(sensor73[pas1], axis = 0)
mean_pas1.shape
## (251,)
mean_pas2 = np.mean(sensor73[pas2], axis = 0)
mean_pas2.shape
## (251,)
mean_pas3 = np.mean(sensor73[pas3], axis = 0)
mean_pas3.shape
## (251,)
mean_pas4 = np.mean(sensor73[pas4], axis = 0)
mean_pas4.shape
## PLOTTING the mean TESLA values for sensor 73 for each pas rating ##
## (251,)
plt.figure()
plt.plot(times, mean_pas1.T, color = "blue", label = "pas1")
plt.plot(times, mean_pas2.T, color = "green", label = "pas2")
plt.plot(times, mean_pas3.T, color = "orange", label = "pas3")
plt.plot(times, mean_pas4.T, color = "red", label = "pas4")
plt.axvline(color = "black")
plt.axhline(color = "black")
plt.legend()
plt.title("1.2ii: Average Magnetic Field Across Repetitions for Sensor 73")
plt.xlabel("Time (ms)")
plt.ylabel("Magnetic Field (Tesla)")
plt.show()
data_1_2 that only contains PAS responses 1 and 2. Similarly, create a y_1_2 for the target vector## CREATING a new array ##
pas1_2= np.where((y==1)|(y==2))
data_1_2=data[pas1_2]
data_1_2.shape #checking that we did it right
## target vector ##
## (214, 102, 251)
y_1_2 = y[(y==1)|(y==2)]
data_1_2) to be in a 2d-array, which has samples (repetitions) on dimension 1 and features (predictor variables) on dimension 2. Our data_1_2 is a three-dimensional array. Our strategy will be to collapse our two last dimensions (sensors and time) into one dimension, while keeping the first dimension as it is (repetitions). Use np.reshape to create a variable X_1_2 that fulfils these criteria.## RESHAPING to create X_1_2 ##
X_1_2 = data_1_2.reshape(data_1_2.shape[0],-1)
X_1_2.shape
## (214, 25602)
data_1_2.shape[0] is repetitions with pas1 or pas2. This is the dimension we want to keep. -1 collapses the second (sensors) and third (time) dimension.
StandardScaler and scale X_1_2from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
## SCALING X_1_2 ##
X_1_2_scaled = scaler.fit_transform(X_1_2)
X_1_2_scaled.shape
## (214, 25602)
LogisticRegression - can be imported from sklearn.linear_model - make sure there is no penalty appliedfrom sklearn.linear_model import LogisticRegression
lr = LogisticRegression(penalty='none')
## FITTING a logistic regression to our scaled data ##
lr.fit(X_1_2_scaled,y_1_2)
## LogisticRegression(penalty='none')
score method of LogisticRegression to find out how many labels were classified correctly. Are we overfitting? Besides the score, what would make you suspect that we are overfitting?lr.score(X_1_2_scaled, y_1_2)
## 1.0
Our logistic regression correctly classifies 100% of the labels (as the output is 1.0), which makes us suspect that something is wrong. Considering this perfect accuracy, the model is clearly overfitting to the data. This may have happened since we have not split the data into test and training sets, so the model “knows” the correct classifications as it has been trained on the very data it is asked to predict.
.coef_) are non-zero after this?## APPPLYING L1 ##
lr = LogisticRegression(penalty='l1', solver ='liblinear', random_state=1)
lr.fit(X_1_2_scaled,y_1_2)
## COUNTING the non-zero coefficients ##
## LogisticRegression(penalty='l1', random_state=1, solver='liblinear')
np.count_nonzero(lr.coef_)
## 282
plt.imshow. Compared to the plot from 1.1.iii, do we see less covariance?(MS) Firstly, we subset reduced X to only include non-zero coefficients:
## REDUCED X ##
non_zero = np.where(lr.coef_ != 0)
## TAKING the previous X_1_2_scaled and subsetting the non-zero coefficients
reduced_X = X_1_2_scaled[:,non_zero[1]]
reduced_X.shape
## (214, 282)
Now we make the matrix. We would like to end up with a 282 by 282 matrix. When we multiply two matrices, we get the first matrix’ rows and the second matrix’ columns. We choose the formula where the rows of the transposed matrix (282) is multiplied with the columns of the matrix (282): \(X_{reduced}^T X_{reduced}\) or \(features \times features\).
(DB):
## COVARIANCE of Non-zero Features ##
reduced_X_cov = reduced_X.T@reduced_X
reduced_X_cov.shape
## PLOTTING the covariance plot ##
## (282, 282)
plt.figure()
plt.imshow(reduced_X_cov)
plt.title("2.1vii: Non-Zero Coefficients Covariance Matrix")
plt.colorbar()
## <matplotlib.colorbar.Colorbar object at 0x12a693730>
plt.show()
## PLOTTING 1.1iii (scaled) ##
plt.figure()
plt.imshow(cov_scaled, vmin=-200, vmax=200)
plt.title("1.1iii: Scaled Sensor Covariance Matrix")
plt.colorbar()
## <matplotlib.colorbar.Colorbar object at 0x12a63b3d0>
plt.show()
Compared to the scaled plot from 1.1iii, we see less covariance in the 2.1vii plot. This is due to our selection of only non-zero features. We have removed all the features which do not contribute to explaining more variance. In a way, this is the same as keeping as many variables as possible which explain different parts of the variance. For this reason, we expect and see less covariance in the non-zero plot.
cross_val_score and StratifiedKFold from sklearn.model_selectionfrom sklearn.model_selection import cross_val_score, StratifiedKFold
y_1_2_equal, which should have an equal number of each target. Create a similar X_1_2_equal. The function equalize_targets_binary in the code chunk associated with Exercise 2.2.ii can be used. Remember to scale X_1_2_equal!# Exercise 2.2.ii
def equalize_targets_binary(data, y):
np.random.seed(7)
targets = np.unique(y) ## find the number of targets
if len(targets) > 2:
raise NameError("can't have more than two targets")
counts = list()
indices = list()
for target in targets:
counts.append(np.sum(y == target)) ## find the number of each target
indices.append(np.where(y == target)[0]) ## find their indices
min_count = np.min(counts)
# randomly choose trials
first_choice = np.random.choice(indices[0], size=min_count, replace=False)
second_choice = np.random.choice(indices[1], size=min_count,replace=False)
# create the new data sets
new_indices = np.concatenate((first_choice, second_choice))
new_y = y[new_indices]
new_data = data[new_indices, :, :]
return new_data, new_y
## USING the function ##
data_1_2_equal, y_1_2_equal = equalize_targets_binary(data_1_2, y_1_2)
## INVESTIGATING y_1_2_equal & data_1_2_equal ##
y_1_2_equal.shape
## (198,)
data_1_2_equal.shape
## RESHAPING ##
## (198, 102, 251)
X_1_2_equal = data_1_2_equal.reshape(data_1_2_equal.shape[0],-1)
X_1_2_equal.shape
## SCALING ##
## (198, 25602)
scaler = StandardScaler()
X_1_2_equal_scaled = scaler.fit_transform(X_1_2_equal)
## CHECKING that we did it right ##
X_1_2_equal_scaled.shape
## (198, 25602)
By running X_1_2_equal_scaled.shape, we see that we have now 198 pas ratings (corresponding to 99 pas1 and 99 pas2 ratings) and the 25602 coefficients.
LogisticRegression (See Exercise 2.1.iv)## FITTING the model ##
log = LogisticRegression(penalty='none')
log.fit(X_1_2_equal_scaled, y_1_2_equal)
## CROSS-VALIDATION ##
## LogisticRegression(penalty='none')
cv = StratifiedKFold(n_splits=5) #defining that we want 5 folds
scores_log = cross_val_score(log, X_1_2_equal_scaled, y_1_2_equal, cv=cv)
print("Cross-Validation Score (no penality):", round(np.mean(scores_log), 3))
## Cross-Validation Score (no penality): 0.535
Our model is barely above chance level (0.535) when you take the mean of the prediction accuracy from the 5 cv fits.
Cs= [1e5, 1e1, 1e-5]. Use the same kind of cross-validation as in Exercise 2.2.iii. In the best-scoring of these models, how many more/fewer predictions are correct (on average)?(DB):
#LOG 1e5
log_1e5 = LogisticRegression(penalty='l2', C=1e5)
log_1e5.fit(X_1_2_equal_scaled, y_1_2_equal)
## LogisticRegression(C=100000.0)
scores_log_1e5 = cross_val_score(log_1e5, X_1_2_equal_scaled, y_1_2_equal, cv=cv)
#LOG 1e1
log_1e1 = LogisticRegression(penalty='l2', C=1e1)
log_1e1.fit(X_1_2_equal_scaled, y_1_2_equal)
## LogisticRegression(C=10.0)
scores_log_1e1 = cross_val_score(log_1e1, X_1_2_equal_scaled, y_1_2_equal, cv=cv)
#LOG 1e-5
log_1e_neg5 = LogisticRegression(penalty='l2', C=1e-5)
log_1e_neg5.fit(X_1_2_equal_scaled, y_1_2_equal)
## LogisticRegression(C=1e-05)
scores_log_1e_neg5 = cross_val_score(log_1e_neg5, X_1_2_equal_scaled, y_1_2_equal, cv=cv)
## CROSS-VALIDATION Score ##
print("Cross-Validation Score w. C = 1e5:", np.mean(scores_log_1e5).round(3))
## Cross-Validation Score w. C = 1e5: 0.535
print("Cross-Validation Score w. C = 1e1:", np.mean(scores_log_1e1).round(3))
## Cross-Validation Score w. C = 1e1: 0.525
print("Cross-Validation Score w. C = 1e-5:", np.mean(scores_log_1e_neg5).round(3))
## Cross-Validation Score w. C = 1e-5: 0.596
The model with the lowest C (\(C = 1e-5\)) has the highest accuracy (0.596) meaning that introducing penalty/bias improves prediction accuracy. We can count the amount of predictions that are correct and compare with the model with no penalty made in 2.2iii (this model only had an accuracy of 0.535).
(MS):
from sklearn.model_selection import cross_val_predict
## ACCURACY Log 1e-5 (best scoring) ##
predict_log1e_neg5 = cross_val_predict(log_1e_neg5, X_1_2_equal_scaled, y_1_2_equal, cv=cv)
accuracy_log1e_neg5 = predict_log1e_neg5 == y_1_2_equal # list of all true & false classifications
## ACCURACY Log Model 2.2iii ##
predict_log = cross_val_predict(log, X_1_2_equal_scaled, y_1_2_equal, cv=cv)
accuracy_log = predict_log == y_1_2_equal
## COUNTING correct predictions ##
print("Correct Predictions Log 2.2iii:", len(np.where(accuracy_log == True)[0]))
## Correct Predictions Log 2.2iii: 106
print("Correct Predictions Log 1e-5:", len(np.where(accuracy_log1e_neg5 == True)[0]))
## Correct Predictions Log 1e-5: 118
The amount of predictions that are correct with the penalized model (log 1e-5) are 118 whereas the non-penalised model (from 2.2iii) only has 106.
n_sensors * n_samples features, fit a logistic regression (same kind as in Exercise 2.2.iv (use the C that resulted in the best prediction)) for each time sample and use the same cross-validation as in Exercise 2.2.iii. What are the time points where classification is best? Make a plot with time on the x-axis and classification score on the y-axis with a horizontal line at the chance level (what is the chance level for this analysis?)## EMPTY list for the cross scores ##
cross_scores = []
## DEFINING cv again for good measure ##
cv = StratifiedKFold(n_splits=5)
for i in range(251):
#Creating data and scaling
scaler = StandardScaler()
X_time = data_1_2_equal[:,:,i]
X_time_scaled = scaler.fit_transform(X_time)
#Creating a logistic regression object
lr = LogisticRegression(penalty='l2', C=1e-5)
#Cross-validating
score = cross_val_score(lr, X_time_scaled, y_1_2_equal, cv = cv)
#taking the mean
mean = np.mean(score)
#appending the mean
cross_scores.append(mean)
## FINDING the time point where classification is best ##
indexmax = cross_scores.index(max(cross_scores))
times[indexmax]
## 232
plt.figure()
plt.axvline(x = times[indexmax], color = "black", alpha = 0.5)
plt.plot(times, cross_scores)
plt.axhline(y = 0.50, color = "black")
plt.title("L2 PAS 1 & 2: Classification Accuracy vs. Time")
plt.xlabel("Time (ms)")
plt.ylabel("Accuracy")
plt.show()
The chance level for this analysis is 50% since it is a binary classification where it can only classify pas-ratings as \(pas1\) or \(pas2\). Hence, the classification is not much better than chance level.
C=1e-1 - what are the time points when classification is best? (make a plot)?## EMPTY list for the cross scores ##
cross_scores_l1 = []
## DEFINING cv again for good measure ##
cv = StratifiedKFold(n_splits=5)
for i in range(251):
#Creating data and scaling
scaler = StandardScaler()
X_time = data_1_2_equal[:,:,i]
X_time_scaled = scaler.fit_transform(X_time)
#Creating a logistic regression object
lr = LogisticRegression(penalty='l1', solver = "liblinear", C=1e-1)
#Cross-validating
score = cross_val_score(lr, X_time_scaled, y_1_2_equal, cv = cv)
#taking the mean
mean = np.mean(score)
#appending the mean
cross_scores_l1.append(mean)
## FINDING the time point where classification is best ##
indexmax_l1 = cross_scores_l1.index(max(cross_scores_l1))
times[indexmax_l1]
## 244
plt.figure()
plt.axvline(x = times[indexmax_l1], color = "black", alpha = 0.5)
plt.plot(times, cross_scores_l1)
plt.axhline(y = 0.50, color = "black")
plt.title("L1 PAS 1 & 2: Classification Accuracy vs. Time")
plt.xlabel("Time (ms)")
plt.ylabel("Accuracy")
plt.show()
For both the L1 and L2 plot, the time point where classification is best is around 230-40 ms.
data_1_4 and y_1_4 (create a data set and a target vector that only contains PAS responses 1 and 4). What are the time points when classification is best? Make a plot with time on the x-axis and classification score on the y-axis with a horizontal line at the chance level (what is the chance level for this analysis?)(MA): Firstly, we will create a target vector with pas1 and pas4 similar to the one we made in exercise 2.1i
pas1_4 = np.where((y==1)|(y==4))
data_1_4 = data[pas1_4]
## CREATING the target vector ##
y_1_4 = y[(y==1)|(y==4)]
Now we equalize the data:
## USING the function from 2.2ii ##
data_1_4_equal, y_1_4_equal = equalize_targets_binary(data_1_4, y_1_4)
Finally, we cross-validate for each time stamp:
## EMPTY list for the cross scores ##
cross_scores_pas1_4 = []
## DEFINING cv again for good measure ##
cv = StratifiedKFold(n_splits=5)
for i in range(251):
#Creating data and scaling
scaler = StandardScaler()
X_time = data_1_4_equal[:,:,i]
X_time_scaled = scaler.fit_transform(X_time)
#Creating a logistic regression object
lr = LogisticRegression(penalty='l1', solver = "liblinear", C=1e-1)
#Cross-validating
score = cross_val_score(lr, X_time_scaled, y_1_4_equal, cv = cv)
#taking the mean
mean = np.mean(score)
#appending the mean
cross_scores_pas1_4.append(mean)
(MS):
## FINDING the time point where classification is best ##
indexmax_pas_1_4 = cross_scores_pas1_4.index(max(cross_scores_pas1_4))
times[indexmax_pas_1_4]
## 236
plt.figure()
plt.axvline(x = times[indexmax_pas_1_4], color = "black", alpha = 0.5)
plt.plot(times, cross_scores_pas1_4)
plt.axhline(y = 0.50, color = "black")
plt.title("L1 PAS 1 & 4: Classification Accuracy vs. Time")
plt.xlabel("Time (ms)")
plt.ylabel("Accuracy")
plt.show()
The chance level for this analysis is again 50% as it is once again a binary classification.
The time point where the classification is the best is 236ms for the L1 PAS1 & PAS4 logistic regression which is quite close to the previous models.
Theoretically, there should be a greater difference between PAS1 and PAS4 compared to the difference between PAS1 and PAS2 considering the meaning of these ratings as a great difference in experience. However, our pairwise comparisons in exercise 2.2 give quite similar results.
Another surprise has been the accuracy of our models in general which are close to chance level.
` #### (MA) i. First equalize the number of targets using the function associated with each PAS-rating using the function associated with Exercise 3.1.i
def equalize_targets(data, y):
np.random.seed(7)
targets = np.unique(y)
counts = list()
indices = list()
for target in targets:
counts.append(np.sum(y == target))
indices.append(np.where(y == target)[0])
min_count = np.min(counts)
first_choice = np.random.choice(indices[0], size=min_count, replace=False)
second_choice = np.random.choice(indices[1], size=min_count, replace=False)
third_choice = np.random.choice(indices[2], size=min_count, replace=False)
fourth_choice = np.random.choice(indices[3], size=min_count, replace=False)
new_indices = np.concatenate((first_choice, second_choice,
third_choice, fourth_choice))
new_y = y[new_indices]
new_data = data[new_indices, :, :]
return new_data, new_y
data_all_pas_eq, all_pas_y_eq = equalize_targets(data, y)
Firstly, we reshape and scale our data:
## RESHAPING ##
X_all_eq = data_all_pas_eq.reshape(data_all_pas_eq.shape[0],-1)
X_all_eq.shape
## SCALING ##
## (396, 25602)
scaler = StandardScaler()
X_all_eq_scaled = scaler.fit_transform(X_all_eq)
X_all_eq_scaled.shape
## (396, 25602)
from sklearn.svm import SVC
## CREATING the classifier with a linear kernel ##
Lsvc = SVC(kernel = 'linear')
## CREATING the classifier with a radial basis ##
Rsvc = SVC(kernel = 'rbf')
To compare the prediction accuracy of the two classifiers, we cross-validate them:
## DEFINING cv again for good measure ##
cv = StratifiedKFold(n_splits=5)
## CROSS-VALIDATING the linear support vector ##
Lsvc_scores = cross_val_score(Lsvc, X_all_eq_scaled, all_pas_y_eq, cv=cv)
## CROSS-VALIDATING the radial support vector ##
Rsvc_scores = cross_val_score(Rsvc, X_all_eq_scaled, all_pas_y_eq, cv=cv)
## printing the mean of the cross-validated performances ##
print("Lsvc Mean Cross Validated:", round(np.mean(Lsvc_scores), 3))
## Lsvc Mean Cross Validated: 0.293
print("Rsvc Mean Cross Validated:", round(np.mean(Rsvc_scores), 3))
## Rsvc Mean Cross Validated: 0.333
The mean accuracy is higher for our support vector machine with a radial basis kernel (0.333) although both accuracies are quite low (slightly above chance level at 25 %).
(ADS):
## EMPTY list for the cross scores ##
cross_scores_Rsvc = []
## DEFINING cv again for good measure ##
cv = StratifiedKFold(n_splits=5)
for i in range(251):
#Creating data and scaling
scaler = StandardScaler()
X_time = data_all_pas_eq[:,:,i]
X_time_scaled = scaler.fit_transform(X_time)
#Instantiating a support vector machine with a radial basis
Rsvc = SVC(kernel = 'rbf')
#Cross-validating
score = cross_val_score(Rsvc, X_time_scaled, all_pas_y_eq, cv = cv)
#taking the mean
mean = np.mean(score)
#appending the mean
cross_scores_Rsvc.append(mean)
(MA):
## FINDING the time point where classification is best ##
indexRsvc = cross_scores_Rsvc.index(max(cross_scores_Rsvc))
times[indexRsvc]
## 704
plt.figure()
plt.axvline(x = times[indexRsvc], color = "black", alpha = 0.7)
plt.axhline(y = 0.25, color = "black") #horizontal line at chance level
plt.plot(times, cross_scores_Rsvc)
plt.title("SVC with Radial Basis on all PAS: Classification Accuracy vs. Time")
plt.xlabel("Time (ms)")
plt.ylabel("Accuracy")
plt.show()
Since it is a multinomial classification with 4 categories, the chance level is 25 %.
Looking at the plot from 3.1iii visually, we note that classification of PAS ratings at around 200-250 is above chance level. Hence it could be argued that a classification of subjective experience is possible at this time interval despite far from perfectly accurate.
train_test_split from sklearn.model_selectionfrom sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_all_eq_scaled, all_pas_y_eq, test_size=0.30, random_state=42) #we save to four different variables, the random_state ensures that it randomly sorts the data-set into train and test
fitthe training set and predict on the test set. This time your features are the number of sensors multiplied by the number of samples.## INSTANTIATING kernel ##
Rsvc = SVC(kernel = 'rbf')
## FITTING ##
Rsvc.fit(X_train, y_train)
## PREDICTING ##
## SVC()
predicted_y = Rsvc.predict(X_test)
acc = list(y_test == predicted_y)
## PROPORTION of correctly predicted pas-scores ##
print("Proportion of correctly predicted pas:", round(acc.count(True)/len(acc), 3))
## Proportion of correctly predicted pas: 0.277
Performance is just above chance level (0.277).
## CONFUSION matrix ##
import pandas as pd
predicted_y = pd.Series(predicted_y, name = 'Predicted PAS')
y_test = pd.Series(y_test, name = 'Actual PAS')
print(pd.crosstab(y_test, predicted_y))
## Predicted PAS 1 2 3 4
## Actual PAS
## 1 4 5 16 12
## 2 1 8 12 7
## 3 1 4 14 7
## 4 2 5 14 7
(MA): We see an exaggerated representation in the 3rd column (where pas is predicted to be 3), indicating a bias. No matter the actual pas-rating, the model will most frequently predict the pas-rating to be 3 (for each row, the highest value is in the 3rd column). Misclassifications are most frequent when the actual pas-rating was 1 where 89% of trials are misclassified (see calculations below).
(DB): We imagine that pas3 (almost clear experience) is the “broadest” of the rating definitions thereby allowing participants to classify a broader variety of experiences as pas3 (contingent on their definition of “almost”). We could therefore expect that pas3 ratings are more scattered across the multidimensional space, making them harder for the SVM to isolate pas3 from the other ratings. Thus, pas3 is the obvious/safe guess in most cases for our model (which only performs slightly better than chance).
#MISCLASSIFICATIONS
print("Pas1 Misclassification:", round((5+16+12)/(4+5+16+12)*100, 3)) #pas1
## Pas1 Misclassification: 89.189
print("Pas2 Misclassification:", round((1+12+7)/(1+8+12+7)*100, 3)) #pas2
## Pas2 Misclassification: 71.429
print("Pas3 Misclassification:", round((1+5+7)/(1+4+14+7)*100, 3)) #pas3
## Pas3 Misclassification: 50.0
print("Pas4 Misclassification:", round((2+5+14)/(2+5+14+7)*100, 3)) #pas4
## Pas4 Misclassification: 75.0